The purpose of today’s lab is to introduce tools for sampling from and calculating statistics for different types of distributions in R. The content of the lab will be split into two sections. The first section will focus on binomial distributions and the second section will focus on normal distributions. The Minihacks will test your knowledge of both types of distributions, as well as some distributions that are not discussed in the lab but are nonetheless important.
To quickly navigate to the desired section, click one of the following links:
Imagine you are flipping a coin. If it is a fair coin, you would expect a 50% chance of the coin landing on heads and a 50% chance of the coin landing on heads. However, as shown in the animation below, every heads is not always paired with a tails. Sometimes there is a run of heads and sometimes there is a run of tails. Still, we would probably expect that, overall, there would be the same number of heads as tails. In other words, we would expect that if we flipped a single coin 100 times, the most likely outcome would be 50 heads and 50 tails.
As you might recall from class, our intuition about the outcome of the 100 coin flips can be described in terms of a binomial distribution. Essentially, the binomial distribution describes the theoretical probability of obtaining a certain outcome over a number of trials when (1) the outcome on every trial is binary (e.g., a coin landed on a heads or a tails; a dice was either a 6 or not a 6) and the probability of the outcome on every trial is the same (e.g., the probability of getting a heads on flip 1 is the same as the probability of getting a heads on flip 100).
If we plot the probability distribution of the example above, two things are apparent: (1) the most probable outcome of flipping 100 coins is 50 heads and (2) there are additional outcomes that are, although less probable, also possible. For instance, one could even expect to get 0 heads 0.00000000000000000000000000008% of the time.
If we want to randomly sample trials from a binomial distribution, we can use the rbinom() function in R. The function takes three arguments. The first argument (n) is the number of trials to sample. If we wanted to flip 2 coins 10 times, we would include the argument n = 10. The second argument (size) is the number of events associated with each trial. If we are flipping 2 coins, we would include size = 2. The third argument (prob) is the probability of success on a given trial. If we consider a heads a success and everything else a failure, we would include the argument prob = 1/2. Putting all of that together, we get rbinom(n = 10, size = 2, .5).
rbinom(n = 10, size = 2, prob = 1/2)
## [1] 0 2 1 1 1 1 0 1 1 1
From the results, we can see we got 0 heads on the first toss of our two coins, 2 heads on the second toss our two coins, and 1 head on the third through sixth tosses of our two coins.
How would we change this if we were flipping 3 coins 10 times?
rbinom(n = 10, size = 3, prob = .5)
## [1] 2 2 2 2 2 2 0 2 3 1
What about rolling 5 6-sided dice 10 times where getting a 6 is considered a successful outcome?
rbinom(n = 10, size = 5, prob = 1/6)
## [1] 0 0 0 0 0 1 1 2 1 1
What about pulling an Ace out of 1 deck of cards 100 times?
rbinom(n = 100, size = 1, prob = 1/13)
## [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0
## [36] 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
## [71] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
The function dbinom gives us the probability of getting any one result. It takes four arguments, but we will only concern ourselves with the first three: (1) x - the number of successful outcomes expected, (2) size - the number of events, and (3) prob - the probability of success on a given event.
To get the probability of getting 1 heads by flipping 2 coins, we could run the following code.
dbinom(x = 1, size = 2, prob = .5)
## [1] 0.5
There is a .50 probability of getting 1 heads when you flip 2 coins. We can investigate why this is the case, by looking at the probability of every outcome.
# HT + TH
(.5 * .5) + (.5 * .5)
## [1] 0.5
What about the probability of getting 1 head when you flip 3 coins?
dbinom(x = 1, size = 3, prob = .5)
## [1] 0.375
# HTT + THT + TTH
(.5 * .5 * .5) + (.5 * .5 * .5) + (.5 * .5 * .5)
## [1] 0.375
What’s the probability of drawing 2 aces out of a 1 deck of cards (with replacement)?
dbinom(x = 2, size = 2, prob = 1/13)
## [1] 0.00591716
# AA
(1 / 13) * (1 / 13)
## [1] 0.00591716
What’s the probability of drawing 0 Aces out of a deck of cards (with replacement)?
# method 1
dbinom(x = 0, size = 2, prob = 1/13)
## [1] 0.852071
# method 2
dbinom(x = 2, size = 2, prob = 12/13)
## [1] 0.852071
# 00
(12 / 13) * (12 / 13)
## [1] 0.852071
If we want to calculate the cumulative probability of getting a certain result (i.e., the probability of getting a result equal to or less than what we expect), we would use the function pbinom(). Cumulative probability may not sound important, but it is when you consider that a p-value is the probability of getting a result equal to or more extreme than that observed in the sample. The function pbinom() takes essentially the same arguments as dbinom(), but instead of the first argument being called x it is called q.
Returning to the example from above, if wanted to get the probability of getting 1 or less heads when we flip 2 coins, we would use pbinom(q = 1, size = 2, prob = 1/2).
pbinom(q = 1, size = 2, prob = 1/2)
## [1] 0.75
The result is .75 because there is a .50 probability of getting 1 head (HT = .5 * .5; TH = .5 * .5) and a .25 chance of getting 0 heads (TT = .5 * .5).
What’s the probability of getting 1 or less heads when flipping 10 coins?
pbinom(q = 1, size = 10, prob = 1/2)
## [1] 0.01074219
What’s the probability of getting 1 or less 6s when rolling 1 dice?
pbinom(q = 1, size = 1, prob = 1/6)
## [1] 1
The function pbinom() can also take the argument lower.tail (defaults to TRUE). The argument lower.tail is what specifies what side of the probability distribution we should be testing from. In practical terms, it is what decided that we wanted 1 or less heads rather than greather than 1 heads.
For instance, if we wanted to test the probability of getting greater than 1 heads when flipping 2 coins, we would simple specify lower.tail = FALSE.
pbinom(q = 1, size = 2, prob = 1/2, lower.tail = FALSE)
## [1] 0.25
What’s the probability of getting greater than 3 6s when rolling 9 dice?
pbinom(q = 3, size = 9, prob = 1/6, lower.tail = FALSE)
## [1] 0.04802149
The function qbinom() essentially does the opposite of pbinom. What I mean by that is, instead of taking an outcome (q) and returning the probability of getting that outcome or less, it takes a probability (p) and returns the probability that corresponds to an outcome or less.
For instance if we wanted the value for which there is 100% probability of getting that value or less on 10 coin flips, we would write qbinom(p = 1.00, size = 10, prob = 1/2).
qbinom(p = 1.00, size = 10, prob = 1/2)
## [1] 10
Unsurprisingly, 10 or less heads has a 100% chance of occurring when you flip 10 coins.
With 100 coin flips, what is the number of heads (or less) that has a .50 probability of occurring?
qbinom(p = .50, size = 100, prob = 1/2)
## [1] 50
With 100 coin flips, what is the number of heads (or less) that has a .25 probability of occurring?
qbinom(p = .25, size = 100, prob = 1/2)
## [1] 47
With 100 coin flips, what is the number of heads (or greater) that has a .25 probability of occurring?
qbinom(p = .25, size = 100, prob = 1/2, lower.tail = FALSE)
## [1] 53
Recall from class that a normal distribution is a continuous probability distribution that is defined by a mean (\(\mu\)) and a standard deviation (\(\sigma\)). Whereas the binomial distribution describes the theoretical probability of obtaining a certain outcome over a number of trials when the outcome of every trial is binary, the normal distribution describe the theoretical probability of obtaining a certain outcome on a continuous distribution that has a certain mean (\(\mu\)) and standard deviation (\(\sigma\)).
In order to randomly sample observations from a normal distribution, we use the function rnorm(). Similar to rbinom(), rnorm() takes three arguments: (1) n - the number of observations to sample from the normal distribution, (2) mean - the mean of the normal distribution, and (3) sd - the standard deviation of the normal distribution.
Below we sample 5 values from a normal distribution with a mean of 0 and a sd of 1.
rnorm(n = 5, mean = 0, sd = 1)
## [1] -0.89691455 0.18484918 1.58784533 -1.13037567 -0.08025176
The five values were -0.8969145, 0.1848492, 1.5878453, -1.13037567 and -0.08025176.
mean(c(-0.89691455, 0.18484918, 1.58784533, -1.13037567, -0.08025176))
## [1] -0.06696949
sd(c(-0.89691455, 0.18484918, 1.58784533, -1.13037567, -0.08025176))
## [1] 1.0749
Calculating the mean() and sd() of our 5 numbers can serve as a bit of a sanity check.
If we plot a histogram of the data, it doesn’t look too normally distributed.
Not to worry! As illustrated in the animation below, many samples do not appear normal until samples of sufficient size are taken from the population.
How would you sample 10 observations from a normal distribution with a mean of 100 and a standard deviation of 10?
x <- rnorm(n = 100, mean = 50, sd = 15)
x
## [1] 70.10702 33.26432 72.59808 25.85051 64.74490 42.08073 40.90247
## [8] 31.44694 60.84194 40.86770 78.70571 37.72669 60.77980 45.40782
## [15] 38.96492 68.91429 56.77912 31.54454 64.42119 42.73362 49.73978
## [22] 39.50173 80.14184 39.56021 20.54169 49.77191 65.71358 35.53914
## [29] 10.88095 44.83591 44.69725 49.81173 52.03666 39.80848 40.65800
## [36] 56.30383 40.88204 52.85271 66.08110 40.23969 20.72601 41.70519
## [43] 72.86999 52.62223 85.21151 37.54641 46.89434 69.44872 42.42647
## [50] 48.28229 37.41338 55.13768 69.42520 41.95130 75.69487 41.86698
## [57] 62.14193 43.11261 40.42576 53.27791 15.67171 46.75063 66.48480
## [64] 38.93634 51.66526 46.54303 41.52317 39.90583 54.62883 53.93384
## [71] 70.03018 56.37187 38.78450 81.85908 66.13855 63.98095 46.74410
## [78] 54.21393 41.15392 34.31660 64.93405 69.60571 57.38871 48.88348
## [85] 45.86954 50.38997 78.66571 78.81954 42.40974 62.19493 58.28126
## [92] 39.91096 58.55067 58.07785 78.43734 63.38223 52.78941 52.53657
## [99] 57.99779 31.48749
Are these descriptives for this sample what we would expect?
mean(x)
## [1] 51.26711
sd(x)
## [1] 15.20467
The normal distribution counterpart of dbinom is dnorm. Similar to rnorm, it takes a mean (mean) and a standard deviation (sd) argument, but instead of an argument for the number of observations (n) you provide it a value (x). As mentioned in class, the probability of any one value in a normal distribution is 0.00 because the total probability of the distribution is 1.00 and there are infinite values in any continuous distribution. The reason dnorm exists is mostly mathematical, but we can use it to the height of the probability curve for any value.
dnorm(0, mean = 0, sd = 1)
## [1] 0.3989423
For example, if we enter the value 0, with a mean of 0 and a standard deviation 1, we get 0.3989423
dnorm_plot(0)
As shown in the plot above, the height of the probability curve at a value of 0 is .399.
Likewise, if we calculate the height of the probability plot at an x value of 1, we see the result 0.2419707.
dnorm(1, mean = 0, sd = 1)
## [1] 0.2419707
Looking at the plotted normal distribution, this is also expected.
dnorm_plot(1)
What would be the height of the probability line at a value of -1 when the mean is 0 and the sd is 1?
dnorm(x = -2, mean = 0, sd = 1)
## [1] 0.05399097
Like pbinom(), pnorm tells you the probability of getting a certain value (or less) in a given normal distribution. Once again, you can set the mean and standard deviation of the distribution using mean and sd. Instead of taking its value using x it takes its argument using the q argument.
For example, if wanted to calculate the probability of the shaded region above, we would use pnorm(q = 0, mean = 0, sd = 1).
pnorm(q = 0, mean = 0, sd = 1)
## [1] 0.5
As would be expected, we have a 50% probability of getting a value in the shaded region of the plot. However, for most of us, it is not intuitive to try to convert the shaded region under a curve to a probability value. An easier way to visualize it is through the use of a cumulative probability plot, where each shaded ball represents a .01 probability.
If we were to count every shaded ball on the left of the vertical line below, we would see that there are 50 balls to the left of the red line. In other words, there is a 50% probability of getting a 0 or less in our distribution.
So, what’s the probability of getting 40 or less when the mean is 50 and the standard deviation is 10?
ball_plot(q = 40, mean = 50, sd = 10)
Looking at our cumulative probability plot, it looks like 16 balls are on the left of the dashed red line. We can check this with pnorm().
pnorm(q = 40, mean = 50, sd = 10)
## [1] 0.1586553
Looks like the probability of getting a value of 40 or less when the mean of the normal distribution is 50 and the standard deviation is 10 is slightly less than .16. We can also work through this conceptually. We expect that 68% of values on a normal distribution will fall between plus or minus one standard deviation. In our example, 40 is one standard deviation below the mean. As such, the probability of getting 40 or a value less than 40 would be \(\frac{1 - 0.68}{2} = \frac{.32}{2} = .16\).
What’s the probability of getting a value of 60 or less, when the mean is 30 and the standard deviation is 15?
ball_plot(q = 60, mean = 30, sd = 15)
pnorm(q = 60, mean = 30, sd = 15)
## [1] 0.9772499
((1 - .9545) / 2) + .9545
## [1] 0.97725
What’s the probability of getting a value greater than -5 when the mean is 0 and the sd is 5?
ball_plot(q = -5, mean = 0, sd = 5, lower.tail = FALSE)
pnorm(q = -5, mean = 0, sd = 5, lower.tail = FALSE)
## [1] 0.8413447
Finally, we can use qnorm to get the value for a variable or less that corresponds to a particular probability. Once again, we can set the mean (mean) and standard deviation (sd) of the distribution, but we use p to set the probability.
For example, to get the value or less that has a probability of 0.00135 of occurring in a distribution with a mean of 0 and a standard deviation of 1 is calculated using qnorm(p = .00135, mean = 0, sd = 1).
qnorm(p = .00135, mean = 0, sd = 1)
## [1] -2.999977
As we can see, the value is just about -3.00. This makes sense if we consider that 99.73% of values are between three standard deviations below and above the mean. Therefore, we would expect the probability of getting -3.00 would be \(\frac{1-.9973}{2} = 0.00135\).
What value or less is associated with a .51 probability in a normal distribution with a mean of 100 and a sd of 10?
qnorm(p = .51, mean = 100, sd = 10)
## [1] 100.2507
What value or greater is associated with a .51 probability in a normal distribution with a mean of 100 and a sd of 10?
qnorm(p = .51, mean = 100, sd = 10, lower.tail = FALSE)
## [1] 99.74931
You are welcome to work with a partner or in a small group of 2-3 people. Please feel free to ask the lab leader any questions you might have!
You are playing Dungeons and Dragons and, to the Dungeon Master’s displeasure, you run immediately to the dragon that is meant to be encountered at the end of her carefully-crafted campaign.
5 20-sided dice, and get a 20 on each die. What is the probability of getting this result?dbinom(x = 5, size = 5, prob = 1/20)
## [1] 0.0000003125
2 (i.e., 3 or more; you will want to specify lower.tail = FALSE) 20s she will let you slay the dragon. What is the probability of getting more than 2 20s when rolling 5 20-sided dice?pbinom(q = 2, size = 5, prob = 1/20, lower.tail = FALSE)
## [1] 0.001158125
.10. She acquiesces. What number of 6s or greater is associated with a probability of .10 when rolling 5 20-sided dice?qbinom(p = .10, size = 5, prob = 1/20, lower.tail = FALSE)
## [1] 1
5 cups of coffee per day and 68% of students drinking between 4 and 6 cups of coffee per day (i.e., the distribution has a standard deviation of 1). What is the probability that a randomly selected graduate student will drink 2 or less cups of coffee per day?pnorm(q = 2, mean = 5, sd = 1)
## [1] 0.001349898
1000 graduate students from the distribution. Plot this as a histogram. Does it look normal?hist(rnorm(n = 1000, mean = 5, sd = 1))
10 cups of coffee per day. Is this a realistic estimate?dnorm(x = 9, mean = 5, sd = 1)
## [1] 0.0001338302
0.00. Calculate the probability that a graduate student would drink 10 or more cups a day.pnorm(q = 9, mean = 5, sd = 1, lower.tail = FALSE)
## [1] 0.00003167124
1 and 100. You think of 37 and the magician guesses 37. Assuming the choice of numbers follows a uniform distribution, what is the probability that the magician guessed your number at random? Use dunif to prove your intuition is correct.dunif(x = 37, min = 0, max = 100)
## [1] 0.01
3.00 with 10 degrees of freedom.pchisq(q = 3.00, df = 10, lower.tail = FALSE)
## [1] 0.9814241
3.00 or greater with 10 degrees of freedom from a t-distribution? Google (or use your intuition) to determine how to calculate the cumulative probability from a t-distribution.pt(q = 3, df = 10, lower.tail = FALSE)
## [1] 0.006671828